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Abstract. General purpose computing on graphic processing units 
(GPU) is a potential method of speeding up scientific computation with 
low cost and high energy efficiency. We experimented with the particle 
physics simulation toolkit Geant4 used at CERN to benchmark its geom- 
etry navigation functionality on a GPU. The goal was to find out whether 
Geant4 physics simulations could benefit from GPU acceleration and how 
difficult it is to modify Geant4 code to run in a GPU. 

We ported selected parts of Geant4 code to C99 & CUDA and im- 
plemented a simple 7-physics simulation utilizing this code to measure 
efficiency. The performance of the program was tested by running it on 
two different platforms: NVIDIA GeForce 470 GTX GPU and a 12-core 
AMD CPU system. Our conclusion was that GPUs can be a competi- 
tive alternate for multi-core computers but porting existing software in 
an efficient way is challenging. 

1 Introduction 

Graphic processing units (GPUs), also known as graphic adapters, or 3D cards, 
were originally expansion cards for personal computers. The purpose of these 
devices is to externalize graphics related computing from the main processor 
(CPU) and to increase the speed of these calculations with specialized hardware. 
The main driving force of GPU technology development has been the demand 
for fast 3D graphics in computer games. Polygon-based rendering is an ideal ap- 
plication for specialized parallel co-processor architectures. Subsequently more 
complex and customizable computation capabilities have been added, and re- 
cently, GPUs have developed into fully programmable general purpose parallel 
computers [13]. 

The modern GPUs can achieve high computation speeds, measured in float- 
ing point operations per second (FLOPS), compared to their size, cost and 
energy consumption. This makes them attractive for use also outside the com- 
puter game industry. These devices, and similar GPU-derived accelerators with 
no graphic processing capabilities, have recently developed interest in many 
fields of scientific computing and have been successfully used to speed up vari- 



ous applications. [TT] 

The high computing performance on GPUs is based on massive parallelism 
and is generally achievable only if the computation can be split into a large 
number of parallel tasks that require little global communication. GPU pro- 
gramming can be done using relatively high-level languages and toolkits with 
varying level of portability. Alternatives include NVIDIA's CUDA [1] and the 
newer cross-platform OpenCL [15] by Khronos group. 

In the current work we study whether GPUs are beneficial for particle physics 
simulations and how existing physics software can be ported to GPUs. The 
motivation for our research is the Large Hadron Collider particle accelerator 
that started running at CERN during in late 2009 and it is now producing 
data for physic experiments. The purpose of this accelerator is to provide new 
information on the basic structure of matter. From computing and data storage 
point of view, LHC experiments produce over 15 petabytes of data annually. 

2 Background 

2.1 Monte Carlo Simulations in Particle Physics 

Particle physics simulations have an essential role in contemporary experimen- 
tal nuclear and particle physics [3j. They are used in designing and calibrat- 
ing particle detectors and generating detectable signatures for hypothesized 
physics phenomena. Modern particle physics experiments require increasingly 
large amounts of Monte Carlo data from complex and computationally inten- 
sive simulations [3 . Examples of physics experiments requiring extensive Monte 
Carlo simulations include CMS [6] and ATLAS [2] experiments related to the 
Large Hadron Collider (LHC) [8] at CERN, Geneva, Switzerland. 

Monte Carlo particle physics simulations are also important in other fields, 
such as medical physics [161 [31 lllj , where an example application is the planning 
of maximally safe and accurate radiotherapy treatments. 

2.2 Geant4 

Geant4 [3! is "a toolkit for simulating the passage of particles through matter" . 
It can simulate a comprehensive set of different physical processes over a wide 
range of particle energies for a variety of long-lived particles. It is designed to 
offer a comprehensive set of functionalities needed to construct a (Monte Carlo) 
particle physics simulation program for a specific setup. It is a large object 
oriented system implemented in CH — h 

The Geant4 toolkit is used in many contemporary particle physics experi- 
ments, such as CMS [5] and ATLAS [2] at CERN and BaBar g] at SLAG It is 
also utilized in medical applications, for example in radio therapy planning [TT] . 
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2.3 The NVIDIA Fermi GPU Architecture 



This is a brief description of the hardware architecture (NVIDIA "Fermi") of 
the GPU used in our experiments. Its differences compared to contemporary 
multi-core CPU architectures are a key factor determining the performance of 
GPU programs. 

In the Fermi architecture [T], a device (e.g. one GPU) consists of several 
(e.g. 16 on the Tesla GPUs) streaming multiprocessors and each multiprocessor 
has a number (32) of cores. Each core executes a warp of threads (32 threads) 
in a manner called Single- Instruction, Multiple- Thread (SIMT) by NVIDIA. In 
SIMT, the threads in one warp always execute a common instruction with dif- 
ferent data. If the instruction paths of the threads diverge on a data-dependent 
branch instruction, the divergent branches are executed sequentially, that is, 
threads in one branch wait for the threads in the other branch to execute. SIMT 
can be viewed as a special case of the SIMD (Single-Instruction, Multiple-Data) 
architecture. 

The device has a large global memory with unsynchronized read-write ac- 
cess from the multiprocessors. There is also cached read-only memory, constant 
memory and special texture memory with automatic interpolation. Each mul- 
tiprocessor has a smaller amount of shared memory that can be used manually 
for local synchronization or automatically as LI cache for the global memory. In 
addition, each multiprocessor has a set of 32-bit registers. The Fermi GPUs are 
capable of native 32-bit and 64-bit IEEE-compliant floating point calculations. 

In order to efficiently exploit the full potential of a Fermi GPU, one must 
be able to parallelize the computational problem to thousands of threads that 
require little global synchronization. One multiprocessor can execute 32 x 32 = 
1024 threads concurrently and to hide latencies from global memory access with 
context switching, there should be multiple warps of suspended threads waiting 
on each multiprocessor. There are also other constraints, such as the limited 
number of registers on each multiprocessor. 

Because of these constraints and the nature of SIMT architecture, algorithms 
that can be implemented efficiently on parallel MIMD (Multiple-Instruction, 
Multiple-Data) machines, such as multi-core CPUs, do not necessarily execute 
efficiently on NVIDIA GPUs. 



3 Methods 

We study a minimalistic Monte Carlo physics simulation program for calculating 
where 7-particles interact with matter. A similar program could, in theory, 
be used as an GPU accelerated part of a more complex physics framework. 
The most computationally intensive parts of this program are the geometry 



navigation calculations (see section 3.2 1, that are done using Geant4 code. 
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Algorithm 1 Monte Carlo simulation for free path lengths of particles 



1. Generate W ~ Uniform((0, 1]) 

2. Set Y = - \ogW (now Y:s have distribution Exp(l)) [3J. 

3. Calculate T = G Xo 1 ^(r) as follows: 

Set 5 = 0, t = 0, j <- 
While 5j < Y", do 
Set j <- j + 1 

Compute the next intersection tj > and get fij using 

Geant4. 

Set Sj — Sj-i + (tj — tj-i)[Xj (may be oo) 
Set T = + Y - s >- 1 



3.1 The Simple Particle Physics Simulation 

Suppose a beam of N(0) particles moves through matter. The number of parti- 
cles left after distance x can be modeled with the differential equation 

^ = -N(xMx), (1) 

where fi = ctpNa/A is the mean free path in a material with density p, atomic 
weight A and a total cross section of a [9 a . The solution of ([!]) is 

N(x) = N(0)e~ J? = : N(0)e- G(x) (2) 

and the probability of a single particle interacting before traveling distance x is 
therefore 

F(X<x) = l-^ = l-e- G ^, (3) 

where X is a random variable representing the free path length of the parti- 
cle. Assuming G is invertible, we can generate random variables Z with this 
distribution from Exp(l)-distributed variables Y as 

Z = G-\Y). (4) 

Suppose we have a geometry of objects defining an attenuation coefficient 
p(x) in a point i£t 3 , and a particle with starting point x and a unit direction 
u>. The probability of the particle interacting before distance t > is 

P(T < t) = 1- e-^o /*(«o+v«)d v _. j _ e -G a!0 ,„(t)_ 

To generate random variables T with the correct distribution, one must be able 
to compute G~ 1 0J (i). Suppose the geometry is composed of a finite amount of 
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different materials that define regions with piecewise smooth boundaries. Then 
we may calculate the integral as 

M 

G Xo ^{t) = 2^(U — ti-i)ni + (t — t M )fi M+1 =: Sm + (t — £m)mm+i (6) 

t=l 

and 

G-Uz) = t M + '-XlLiiU-U-!)* = tM + z-Su^ 

Mm+i Mm+i 
where = to < t\ < . . . < ijy < t are the distances to the different material 
boundaries in the path of the particle and \ii are the mass attenuation coefficients 
of the corresponding materials: 

fi(x + toS) = im for all t G {t l _ 1 ,t i ). (8) 

We may therefore generate random variables T with distribution ([5| using an 
uniform random number generator and Geant4 as described in algorithm [T] 

3.2 Geant4 navigation 

In Geant4 terminology navigation is the part of the toolkit that can solve the 
following simply-stated, interrelated, geometrical problems: 

• If a particle is in point A, inside which geometrical objects, volumes, is it? 

• Will the particle hit a volume boundary, that is, will it exit a volume or 
enter a new one on its way to point B? If so, where, and what is the 
normal of the boundary surface at that point? 

These are subproblems for calculating the track of a particle that can interact 
with the different materials it passes through. To solve these problems efficiently 
to calculate the full track, one stores the navigation history of a particle. The 
navigation history describes the volumes, inside which the particle currently is. 
A general Geant4 geometry comprises a hierarchy of nested volumes and the 
navigation history may indeed contain many levels. The full state, including the 
navigation history, of the particle is stored in the navigator, which also has other 
information, such as the normals of the last crossed surface boundaries. 

3.3 Porting Geant4 Navigation to CUDA 

The involved parts of Geant4 were first isolated from the full Geant4 code. They 
were subsequently translated from C++ to C99 by replacing C++ classes with 
C struct s, their methods with functions with this-pointers as parameters and 
replacing C++ standard library containers with C-arrays. C++ virtual function 
were implemented using ''fork" functions that call the appropriate implementa- 
tion based on the object type. This was because of the lack of function pointer 
support in the CUDA version we used. The reason to translate the program to 
C was to allow the program to be also compiled as OpenCL. 
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3.4 Benchmark Program 

The basic operation principle of the benchmark program is listed as algorithm 
[2j The execution phase ([5| is timed. All random numbers (in phase |3| are 
generated using the Mersenne Twister |14| pseudo-random number generator 
from the Boost C++ library. The CPU version of the program can be described 
as leaving out the transfer phases [4] and [6j The single-threaded CPU version of 
the program does phase [5] sequentially. A multi-threaded CPU version that uses 
OpenMP[7] to parallelize phase [5] is also implemented. The GPU version of the 
program uses CUDA. 



Algorithm 2 Basic phases of the benchmark program 

1. Load the geometry and transfer it to GPU memory. 

2. Generate N particles: directions and initial positions (u:,Xo)i,i = 
l,...,N. 

3. Generate an (exponentially distributed) "lifetime" number Y t for each par- 
ticle. 

4. Transfer initial particle states and lifetimes to GPU memory. 

5. Calculate the free path lengths Tj from Yi, [oj, Xo)i as in algorithm [l] for 
each particle in parallel. 

6. Transfer the free path length data Ti back to host memory. 



In this program, the attenuation coefficients fi are calculated from the den- 
sities p in the model geometry as 

fi(x) = p(x)hai,e, (9) 

where hai,e is the mass attenuation coefficient of photons in aluminum for given 
photon energy E. In our tests, all particles have the same energy, 1 MeV, that, 
according to [5], corresponds to a mass attenuation coefficient [aai.e — 0.06146 
cm 2 /g. 

3.5 Test inputs 

We use two different test geometries and two different particle distributions in 
order to demonstrate how the relative performances of CPU and GPU versions 
of the same code can have heavy input dependence. In both cases we have 
N = 100000 input particles. 

In the Random particle setup the directions of the particles, <jJ, are uniformly 
distributed on the unit sphere S 2 . The lifetimes of the particles are generated 
from the exponential distribution. The output of the program is a random sam- 
ple from the distribution of the free path lengths of the particles. In the Regular 
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setup, the particles travel towards a regular 400x250 grid and are input to the 
program in column-major order. Instead of generating Exp(l) distributed life- 
times, all particles are assigned an equal lifetime of 1. In practice, the output in 
this case is a 400x250 volume raytraced bitmap image of the geometry. 

The first test geometry, Spheres, consists of 8000 aluminium balls placed on 
a regular grid in a 5 x 5 x 5 meter vacuum cube. All particles start from the center 
of the geometry. The other, more complex, geometry, CMS, is based on a model 
of the CMS detector [6]. It is constructed by extracting all supported types of 
solids, namely boxes, tubes and cones, from the original model. The densities 
of the materials in the CMS model are also imported to the program. 

4 Results 

The GPU version of the program described in Section [374] was tested on a Linux 
desktop with an NVIDIA GeForce 470 GTX GPU. Both CPU programs, the 
single- and multi-threaded versions, were run on a 12-core AMD CPU system (2 
x 6-Core AMD Opteron 2427). The execution times of the program in different 
test cases are shown in Table [TJ In all test cases, the GPU was faster than a 
single CPU core. However, the speedup varied dramatically depending on the 
input. In the most beneficial example, the GPU was over 13 times faster than 
the single-threaded CPU version and 60% faster than the 12-core CPU system. 
In the most realistic example (CMS, Random), the GPU was only 60% faster 
than the single-threaded CPU version and the 12-core system performed 7 times 
better than this. 

In addition to the run times, we also measured energy usage of the test 
hardware. The results are shown in Figure [4j The idle power consumption of 
GPU is around 20 W that means around 25% increase. The peak power of GPU 
is around 165W that more than doubles the consumption of the desktop. 

300 




Server Desktop & GPU 



Figure 1: Power usage of test hardware 
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Table 1: Performance of the benchmark simulation 



Geometry 


Particles 


Platform 


Exec, time (ms) 


Speedup factor 






CPU, 1-thread 


442 


1 


Spheres 


Regular 


CPU, 12-thread 


54 


8.2 






GPU 


33 


13.3 






CPU, 1-thread 


458 


1 


Spheres 


Random 


CPU, 12-thread 


56 


8.2 






GPU 


114 


4.0 






CPU, 1-thread 


22431 


1 


CMS 


Regular 


CPU, 12-thread 


1862 


12.0 






GPU 


8881 


2.5 






CPU, 1-thread 


15493 


1 


CMS 


Random 


CPU, 12-thread 


1316 


11.8 






GPU 


9977 


1.6 



5 Discussion 

Execution path divergence within a warp of threads is a key performance factor 
on the SIMT architecture [T]. In theory, heavy divergence can potentially slow 
down the program by up to a factor 32. The level of actual run-time divergences 
(that determines the performance) depends not only on the number of branching 
instructions in the program, but also on its input. 

Geant4 Navigation code is complex and heavily branching and thus as such 
not optimal for the SIMT architecture. For regular enough input, however, the 
number level of divergence is low and the program performs well on the GPU. 
Our more realistic inputs do not exhibit such regularity and the corresponding 
GPU performance is poor. The performance is also be affected by other factors 
than branching, such as register space, memory access patterns and double 
precision floating point performance. CUDA profiler results showed that register 
space constraints were indeed a bottleneck factor limiting the performance of 
our application. This is probably due to the relatively large per-particle Geant4 
navigator data structures required in our implementation. 

We also tried more complex load balancing schemes on the GPU, such as 
using CUDPP[10] to periodically remove empty slots (of finished particles) from 
the warps with parallel stream compaction operations and refilling the GPU with 
new particles to maximize occupancy. Again, depending on the test case, this 
resulted to some speedup, but no dramatic performance increases were observed. 

GPU implementations of Monte Carlo particle physics simulations have also 
been studied by [11] and [12]. Jahnke et.al. [11] have implemented a Geant4- 
based software for radiotherapy calculations on GPUs using CUDA. They report 
a 10-30 times speedup compared to a single core CPU when using an NVIDIA 
GTX 8800 graphics card. Jia et al. [T2] have implemented a dose planning 
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method package, also for radiotherapy, in CUDA, and report 5-7 times speedups 
when an NVIDIA Tesla C1060 GPU is compared against a 2.27GHz Intel Xcon 
CPU. 

Our results are not that promising. Different test hardware can explain re- 
sults a lot, but it also seems that efficient implementation of GPU algorithms 
is also essential. The problem in this approach for hardware benchmarking is 
that it is very difficult to compare the performance of different algorithms on 
different hardware. 

It seems that an efficient GPU implementation of Geant4 in this context 
would indeed require rewriting the algorithms for the target GPU architecture, 
instead of porting code to CUDA as such. Especially, our results show, that it 
is not sufficient for an algorithm to be easily parallelizable on CPU (MIMD) 
architectures to yield an efficient GPU implementation. 

6 Conclusions 

We ported a piece of code from a large particle physics simulation toolkit to a 
general purpose GPU and compared its performance in a GPU and a 12 core 
CPU server. 

The results indicated that the performance on the GPU may be heavily 
input-dependent. In our test cases, the GPU program, when compared to a 
single-threaded CPU version, was faster by a factor 1.6 - 13, depending on the 
test case. The corresponding speedup factor on the 12-core CPU system was 
8-12 and it outperformed the GPU in most cases. 

The algorithms were originally designed for CPUs and quite straightfor- 
wardly ported to GPU. Our results indicated that an efficient GPU implemen- 
tation of Geant4 in the particle physics context would probably require rewriting 
the algorithms for the target GPU architecture. 

Based on our case, we can say that GPU computing can be a cost-effective 
solution for simple enough problems but accelerating complex software packages 
efficiently using GPGPU is not necessarily a feasible alternative, even if the 
software was easily parallelizable on CPU systems. 
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